This is a project of the course Introduction to Open Data Science.
I heard about this course in one of the emails for doctoral students. I’m hoping to:
Here is the link to my Github repository.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Fri Dec 04 01:28:58 2020"
date()
## [1] "Fri Dec 04 01:28:58 2020"
This chapter describes an application of regression model.
The data, which is analyzed in this chapter, is a subset of International survey of Approaches to Learning data set taken from here. Variable descriptions and other meta data of this data set can be found following this link.
The subset, which is analyzed in this chapter, can be found in my github page along with R code of its derivation.
The first task is to read the data and check its structure. This is done using R code below.
# Reading the data file
learning2014 <- read.table("data/learning2014.csv", sep=",", header=TRUE)
#Looking at the dimensions and structure of the data
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
Data consists of 166 observations of 7 variables. These variables are:
The second task is to show a graphical overview of the data and show summaries of the variables.
For drawing plots ggplot2 library will be used.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.3
Summary of the variables can be looked up using summary function below.
# Looking at summary statistics of the variables
summary(learning2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
The summary table shows minimum, maximum, mean, median, 1st and 3rd quantile values of each numeric variable. The age of respondents is between 17 and 55 years with the median of 22 years. Exam points vary from 7 to 33 with the median of 22 points. Attitude and learning strategy variables all averages around 3.
In the next step, various plots are used for visualizing the variables and their relationships.
# Making a table of how many responses are from male and how many from female
table_gender <- table(learning2014$gender)
# Visualizing result with a simple barplot
ggplot(data=as.data.frame(table_gender), aes(x=Var1,y=Freq)) +
geom_bar(stat="identity", fill="darkblue")+ theme_minimal()+
labs(x = "Gender", y = 'Count')+ ggtitle("Students' gender")
The number of female respondents is double the size of male respondents.
# Ploting a histogram of age variable
ggplot(learning2014, aes(x=age)) +theme_minimal() + geom_histogram(color="darkblue", fill="lightblue", binwidth=5) + ggtitle("Histogram of students' age")
Most of the respondents are around age 20.
# Ploting a histogram of attitude variable
ggplot(learning2014, aes(x=attitude)) +theme_minimal() + geom_histogram(color="darkblue", fill="lightblue", binwidth=0.5) + ggtitle("Histogram of students' attitude")+ stat_function(fun = function(x) dnorm(x, mean = mean(learning2014$attitude), sd = sd(learning2014$attitude)) * 0.5 * nrow(learning2014))
Students attitude seems to resemble a normal distribution.
#Vizualising attitude vs points data points coloring them by gender and adding regression lines
ggplot(learning2014, aes(x = attitude, y = points, col = gender))+ geom_point()+geom_smooth(method = "lm")+ ggtitle("Student's attitude versus exam points") +theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
From the last plot, it can be sen that with better attitude, exam points tend to increase.
# Drawing a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(learning2014[-1])
Pairwise scatter plots of the variables does not show very clear linear relationships. The most noticeable linear relationship seems to be between the exam points and the attitude as in the previous plot.
The third and fourth tasks are to fit the regression model and explain it.
Linear regression is an approach for modeling the relationship between a dependent variable \(y\) and one or more explanatory variables \(X\).
I chose attitude, deep and strategic learning variables as explanatory variables for the dependent points variable. In this case the model is of a form:
\(points = x_0 + x_1*attitude + x_2 * deep + x_3* stra + e,\) here \(e\) is an un-observable random variable which is assumed to add noise to the observations.
In the below code linear model is fitted for estimating the coefficients \(x_0, x_1, x_2\).
# Fitting a linear model
my_model <- lm(points ~ attitude + deep + stra, data = learning2014)
# Printing out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + deep + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## deep -0.7492 0.7507 -0.998 0.31974
## stra 0.9621 0.5367 1.793 0.07489 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
When the p-value is less than the chosen significance level, we can reject the null hypothesis that the coefficient of model’s variable is equal to 0. It means that the variable adds significant benefit to the model. In this model above, only intercept and attitude has p-values smaller than significance level 0.05. Therefore, I have removed deep and stra variables from the model and build the second model including only attitude variable, which has significant relationship with the target variable.
# Fitting second linear model
my_model2 <- lm(points ~ attitude, data = learning2014)
# Printing out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Below is a code for a fitted line plot which shows a scatterplot of the data with a regression line representing the regression equation.
#Plotting the regression line by the model
ggplot(learning2014, aes(x = attitude, y = points)) +
geom_point() +
stat_smooth(method = "lm", col = "red") +ggtitle("Fitted line plot") +theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
The resulting model has such form
\(points = 11.64 + 3.53*attitude\).
The attitude coefficient in the regression equation is 3.53. This coefficient represents the mean increase of exam points for every additional unit in attitude measurement. That is, if the attitude value increases by 1, then exam points increase by 3.53.
R-squared is a measure of how good is the fit of linear regression models. It indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. In this case independent variable is only attitude and based on R-squared value of summary table, it explains 19% of variance in the points variable. In the case when deep and stra variables are included, the value of R-squared is slightly bigger and shows that 21% of variability in points variable is explained.
The last task is to produce diagnostic plots.
How well the model describes the phenomenon depends on how well the model assumptions fit reality. In linear regression model the main assumption is linearity and the errors are assumed to be normally distributed. Analyzing the residuals of the model provides a method to explore the validity of the model assumptions. A good way of such analysis is diagnostic plots which let to explore normal distribution of errors, constant variation of errors and leverage of observations.
Below is the code for diagnostic plots.
# Drawing the diagnostic plots.
par(mfrow = c(2,2))
plot(my_model2, which = c(1,2,5))
The first plot of residuals vs fitted values show no visible pattern and thus the assumption, that the variance of errors is constant, is not violated.
The second Normal QQ-plot shows reasonable fit with the line which confirms that distribution of errors is similar to normal distribution.
The third Residuals vs Leverage plot show how much impact a singe observation has on a model. It doesn’t look like any of observation would have an unusually high impact.
Even though there seems to be a significant linear relationship between attitude and exam points, the attitude variable explains only 19% of variability of exam points. Therefore, the model does not seem like a very good fit for the data.
date()
## [1] "Fri Dec 04 01:29:03 2020"
This chapter describes an application of logistic regression model.
The data of this chapter is student alcohol consumption data from UCI Machine Learning Repository. It has been pre-processed using the code in my GitHub repository. Data set consists of 35 variables and 370 observations. Information about the variables can be found here. Additional alc_use variable was defined by taking the average of weekday and weekend alcohol use. Additional high_use variable was defined by assigning TRUE value if alc_use is higher than 2 and FALSE value otherwise.
# Reading the data
alc <- read.csv('data/alc.csv')
# Printing out the names of the variables in the data
names(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
# Looking at the dimensions of the data
dim(alc)
## [1] 370 35
# Looking at the structure of the data
str(alc)
## 'data.frame': 370 obs. of 35 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 15 15 15 15 15 15 15 15 15 15 ...
## $ address : chr "R" "R" "R" "R" ...
## $ famsize : chr "GT3" "GT3" "GT3" "GT3" ...
## $ Pstatus : chr "T" "T" "T" "T" ...
## $ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : chr "at_home" "other" "at_home" "services" ...
## $ Fjob : chr "other" "other" "other" "health" ...
## $ reason : chr "home" "reputation" "reputation" "course" ...
## $ guardian : chr "mother" "mother" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : chr "yes" "yes" "yes" "yes" ...
## $ famsup : chr "yes" "yes" "yes" "yes" ...
## $ activities: chr "yes" "no" "yes" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "yes" "yes" "no" "yes" ...
## $ romantic : chr "no" "yes" "no" "no" ...
## $ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : int 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
## $ health : int 1 5 2 5 3 5 5 4 3 4 ...
## $ failures : int 0 1 0 0 1 0 1 0 0 0 ...
## $ paid : chr "yes" "no" "no" "no" ...
## $ absences : int 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : int 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : int 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : int 12 8 12 9 12 12 6 10 16 10 ...
## $ alc_use : int 1 3 1 1 2 1 2 2 2 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
I think that these four variables could be related with alcohol consumption:
# Accessing the tidyverse libraries dplyr and ggplot2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Let’s look at the distribution of sex variable.
#Producing summary statistics of alcohol use
alc %>% group_by(sex, high_use) %>% summarise(count = n(),mean_alc_use = mean(alc_use))
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_alc_use
## <chr> <lgl> <int> <dbl>
## 1 F FALSE 154 1.44
## 2 F TRUE 41 2.54
## 3 M FALSE 105 1.50
## 4 M TRUE 70 3.36
It seems that more males have high alcohol consumption compared to female. My initial hypothesis was right.
Let’s look at distribution of goout variable.
# Initializing a plot of alc_use and going out
g2 <- ggplot(alc, aes(y = alc_use, x = factor(goout), col = sex))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student tendency to go out by alcohol consumption and sex") + theme_classic() + xlab('Going out') + ylab('Alcohol use')
Seems that students who tend to go out more, tend to drink more alcohol. My initial hypothesis was correct. Interestingly, for males, the median of alcohol consumption increases drastically if they go out with friends a lot. For females, median of alcohol consumption does not change between going out moderate amount of times and high amount of times.
Let’s look at famrel variable.
# Initializing a plot of alc_use and family relationships
g2 <- ggplot(alc, aes(x = factor(famrel), y = alc_use, col = sex))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student's family relationships by alcohol consumption and sex") + xlab('Family relationships') + ylab('Alcohol consumption') + theme_light()
My initial hypothesis that family relationships are related to alcohol consumption is correct. It is clearly visible for males that the worse relationship is in family, the higher is alcohol consumption. For females, the median of alcohol consumption is similar except when family relationships are very good. In that case, alcohol consumption drops.
Let’s look at studytime variable.
# Initializing a plot of alc_use and study time
g2 <- ggplot(alc, aes(x = factor(studytime), y = alc_use, col = sex))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ggtitle("Student's study time by alcohol consumption and sex") + theme_dark() + xlab('Study time') + ylab('Alcohol use')
My initial hypothesis was correct. For both males and females, alcohol consumption drops when study time increases.
In this section logistic regression model will be fitted to explore the relationship between my chosen variables and the binary high/low alcohol consumption variable as the target variable.
# Subsetting initial data frame to include only chosen variables and converting them to factors
alc_subset <- alc %>% select(high_use,sex,goout,famrel,studytime) %>% mutate(high_use = factor(high_use),
sex = factor(sex),
goout = factor(goout),
famrel = factor(famrel),
studytime = factor(studytime))
str(alc_subset)
## 'data.frame': 370 obs. of 5 variables:
## $ high_use : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 1 2 1 1 1 2 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ goout : Factor w/ 5 levels "1","2","3","4",..: 2 4 1 2 1 2 2 3 2 3 ...
## $ famrel : Factor w/ 5 levels "1","2","3","4",..: 3 3 4 4 4 4 4 4 4 4 ...
## $ studytime: Factor w/ 4 levels "1","2","3","4": 4 2 1 3 3 3 3 2 4 4 ...
# Finding the model with glm()
m <- glm(high_use ~ sex + goout + famrel + studytime, data = alc_subset, family = "binomial")
# Printing out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ sex + goout + famrel + studytime, family = "binomial",
## data = alc_subset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6339 -0.7259 -0.4807 0.7514 2.5431
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.62942 1.09247 -1.491 0.135833
## sexM 0.87854 0.27963 3.142 0.001679 **
## goout2 0.24560 0.69914 0.351 0.725369
## goout3 0.59346 0.68118 0.871 0.383636
## goout4 2.03442 0.68010 2.991 0.002778 **
## goout5 2.47951 0.70339 3.525 0.000423 ***
## famrel2 0.09062 1.05603 0.086 0.931619
## famrel3 0.17934 0.94055 0.191 0.848782
## famrel4 -0.51821 0.91743 -0.565 0.572171
## famrel5 -1.04992 0.94231 -1.114 0.265193
## studytime2 -0.34253 0.30894 -1.109 0.267537
## studytime3 -0.89382 0.48761 -1.833 0.066796 .
## studytime4 -1.29146 0.65058 -1.985 0.047136 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 363.90 on 357 degrees of freedom
## AIC: 389.9
##
## Number of Fisher Scoring iterations: 5
As all variables in this model are factors, each of the variables are converted into so called dummy variables. The summary of the model shows that only all of famrel dummy coded variables do not have statistically significant relationship with high alcohol consumption. For other variables, at least one of dummy variable has a statistically significant relationship.
# Printing out the coefficients of the model
coef(m)
## (Intercept) sexM goout2 goout3 goout4 goout5
## -1.62941587 0.87853522 0.24560330 0.59345587 2.03442345 2.47950909
## famrel2 famrel3 famrel4 famrel5 studytime2 studytime3
## 0.09061581 0.17933751 -0.51821396 -1.04991854 -0.34253287 -0.89381785
## studytime4
## -1.29145540
The positive sexM coefficient means that probability of high alcohol consumption increases by 0.87 when the person is a male. The positive goout2, goout3, goout4 and goout5 coefficients mean that the probability of high alcohol increases based on how often a person goes out by 0.24, 0.59, 2.03 and 2.47 accordingly. Depending on the family relationships probability of high alcohol consumption decreases when relationships are well and increases otherwise. The more study time increases, the more probability of high alcohol consumption decreases. It seems that the model confirms my previous hypothesis.
# Computing odds ratios (OR)
OR <- coef(m) %>% exp
# Computing confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# Printing out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1960441 0.01756286 1.4112813
## sexM 2.4073708 1.39914040 4.1989737
## goout2 1.2783923 0.35906327 6.0626843
## goout3 1.8102336 0.53279909 8.3774038
## goout4 7.6478415 2.26873938 35.4135478
## goout5 11.9354038 3.36543311 57.2822325
## famrel2 1.0948483 0.15189997 10.7759938
## famrel3 1.1964245 0.21464931 9.9233767
## famrel4 0.5955833 0.11227629 4.7645467
## famrel5 0.3499663 0.06201071 2.8900970
## studytime2 0.7099698 0.38706568 1.3037798
## studytime3 0.4090909 0.15006032 1.0324002
## studytime4 0.2748704 0.06709430 0.9039728
sexM, all goout variables and famrel2 as well as famrel3 variables have OR > 1 which means greater odds of association with high alcohol consumption.
As famrel variable does not show statistical significance, let’s exclude it from the model.
# Fitting the model with glm()
m <- glm(high_use ~ sex + goout + studytime, data = alc_subset, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# Adding the predicted probabilities to 'alc_subset'
alc_subset <- mutate(alc_subset, probability = probabilities)
# Using the probabilities to make a prediction of high_use
alc_subset <- mutate(alc_subset, prediction = probability > 0.5)
# Looking at the last ten original classes, predicted probabilities, and class predictions
tail(alc_subset, 10)
## high_use sex goout famrel studytime probability prediction
## 361 TRUE M 3 5 1 0.3531731 FALSE
## 362 TRUE M 3 4 1 0.3531731 FALSE
## 363 TRUE M 1 4 1 0.2321607 FALSE
## 364 TRUE M 4 3 2 0.5923356 TRUE
## 365 FALSE M 2 3 1 0.2877723 FALSE
## 366 TRUE M 3 4 1 0.3531731 FALSE
## 367 FALSE M 2 4 3 0.1213033 FALSE
## 368 TRUE M 4 4 1 0.6891062 TRUE
## 369 TRUE M 4 5 2 0.5923356 TRUE
## 370 FALSE M 2 4 1 0.2877723 FALSE
# Tabulating the target variable versus the predictions
table(high_use = alc_subset$high_use, prediction = alc_subset$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 240 19
## TRUE 58 53
Table of target variable versus the predictions show that most of students with low alcohol consumption are predicted correctly. Approximately half of students with high alcohol consumption are predicted correctly and half incorrectly.
# Initializing a plot of 'high_use' versus 'probability' in 'alc_subset'
g <- ggplot(alc_subset, aes(x = probability, y = high_use, col = prediction))
# Define the geom as points and draw the plot
g + geom_point()
The plot illustrates that many of the students with high alcohol consumption are not predicted correctly.
# Define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# Calling loss_func to compute the average number of wrong predictions in the (training) data
# alc_subset$high_use is a factor, therefore it needs to be converted into numeric vector.
# 1 is subracted because as.numeric() function converts True values to 2 and False values to 1
loss_func(class = as.numeric(alc_subset$high_use)-1,prob = alc_subset$probability)
## [1] 0.2081081
Now let’s look if model predicts better than random guessing.
# Instead of probabilities predicted by the model, random sample of 0's and 1's is generated by sample() function
loss_func(class = as.numeric(alc_subset$high_use)-1,prob = sample(0:1, nrow(alc_subset), replace=T))
## [1] 0.4945946
Logistic regression model has an error of 20% whereas random guessing gives error of around 50% which means that logistic regression model is reasonably informative with the lower error than random guessing.
Let’s perform 10 fold cross validation.
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc_subset, cost = loss_func, glmfit = m, K = 10)
# Average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.227027
My model has a slightly smaller prediction error (0.23) using 10-fold cross-validation compared to the model introduced in DataCamp (which had about 0.26 error).
date()
## [1] "Fri Dec 04 01:29:07 2020"
In this chapter the Boston data from the MASS package is analyzed.
Let’s load and explore the data.
# Accessing the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Loading the Boston data
data("Boston")
# Looking at the structure of the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim (Boston)
## [1] 506 14
This dataframe contains 506 observation of 14 different variables. These variables are:
crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
medv: median value of owner-occupied homes in $1000s.
lstat: lower status of the population (percent).
Let’s look at the summaries of the variables in the data.
# Looking at summary statistics
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
All of the variables have different scales. For example, rm varies from 3.5 to 8.7 and tax can vary between 187 to 711. This show that data will need to be standardized when applying clustering as features in the data set have large differences in their ranges. Features with higher ranges would have bigger influence on clustering compared to features with smaller ranges as clustering models are distance based algorithms.
Let’s look at the pairwise relationships between the variables.
# Plotting a matrix of the variables
pairs(Boston)
There are some visible linear relationships between the variables. For example, when medv (median value of owner-occupied homes in $1000s) increases rm (average number of rooms per dwelling) seem to increase as well.
Let’s look at the correlations.
# Accessing corrplot and tidyr libraries
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.3
## corrplot 0.84 loaded
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.0.3
# Loading package for color palletes
library(wesanderson)
## Warning: package 'wesanderson' was built under R version 4.0.3
# Calculating the correlation matrix and rounding it
cor_matrix <- cor(Boston) %>% round(digits = 2)
# Printing the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# Visualizing the correlation matrix
corrplot.mixed(cor_matrix, upper.col = wes_palette("Rushmore1", 100, type = "continuous"), lower.col = wes_palette("Rushmore1", 100, type = "continuous"), tl.col= 'black', number.cex = .45, tl.pos = "d", tl.cex = 0.6 )
The highest positive correlation (0.91) is between tax and rad variables. The highest negative correlation (-0.77) is between dis and age variables.
Let’s standardize the data by subtracting the column means from the corresponding columns and dividing the difference with standard deviation.
# Scaling the variables
boston_scaled <- scale(Boston)
# Printing summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
As initial data is a matrix, it needs to be changed into a data frame object.
# Change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
class(boston_scaled)
## [1] "data.frame"
Let’s create a categorical variable of the crime rate in the Boston dataset.
# Using quantiles as break points
break_points <- quantile(boston_scaled$crim)
break_points
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# Creating a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = break_points, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
Let’s drop the old crime rate variable from the dataset.
# Removing original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# Adding the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Let’s divide data into train and test sets.
# Setting the number of rows in the Boston dataset
n <- nrow(boston_scaled)
# Choosing randomly 80% of the rows
set.seed(123)
ind <- sample(n, size = n * 0.8)
# Creating a train set
train <- boston_scaled[ind,]
# Creating a test set
test <- boston_scaled[-ind,]
In this section LDA will be fitted using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.
# Fitting LDA
lda.fit <- lda(crime ~ ., data = train)
# Printing the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2500000 0.2500000 0.2450495
##
## Group means:
## zn indus chas nox rm age
## low 1.01866313 -0.9066422 -0.08120770 -0.8835724 0.38666334 -0.9213895
## med_low -0.07141687 -0.3429155 0.03952046 -0.5768545 -0.09882533 -0.3254849
## med_high -0.40148591 0.2162741 0.19544522 0.3637157 0.12480815 0.4564260
## high -0.48724019 1.0149946 -0.03371693 1.0481437 -0.47733231 0.8274496
## dis rad tax ptratio black lstat
## low 0.9094324 -0.6819317 -0.7458486 -0.4234280 0.3729083 -0.766883775
## med_low 0.3694282 -0.5395408 -0.4205644 -0.1079710 0.3103406 -0.164921798
## med_high -0.3720478 -0.4349280 -0.3191090 -0.2716959 0.1049654 0.009720124
## high -0.8601246 1.6596029 1.5294129 0.8057784 -0.6383964 0.900379309
## medv
## low 0.47009410
## med_low 0.01548761
## med_high 0.19440747
## high -0.71294711
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.148390645 0.74870532 -1.0874785
## indus 0.040971465 -0.38126652 -0.1619456
## chas 0.002460776 0.03963849 0.1699312
## nox 0.312458150 -0.67564471 -1.3104018
## rm 0.011086947 -0.16058718 -0.1572603
## age 0.283482486 -0.38817624 -0.1013491
## dis -0.079848789 -0.38493775 0.2108038
## rad 3.718978412 0.93123532 -0.4706522
## tax -0.015197127 0.04230505 1.2889075
## ptratio 0.180294774 -0.01592588 -0.3558490
## black -0.136724112 0.02948075 0.1288959
## lstat 0.145739238 -0.37823065 0.3345688
## medv 0.061327205 -0.53906503 -0.1509890
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9523 0.0364 0.0113
# The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# Setting target classes as numeric
classes <- as.numeric(train$crime)
# Plotting the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
From LDA bi-plot it can be seen that rad variable is the most infuential linear separator for discriminating high crime rates. Although, some points of medium high rates are mixed in.
Let’s save the crime categories from the test set and then remove it from the test dataset.
# Saving the correct classes from test data
correct_classes <- test$crime
# Removing the crime variable from test data
test <- dplyr::select(test, -crime)
NOw let’s make predictions with LDA model on the test set and cross tabulate the results.
# Predicting the classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# Cross tabulating the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 10 10 4 0
## med_low 2 17 6 0
## med_high 1 9 13 2
## high 0 0 1 27
It seems that the model best predicts high crime rates. All 12 predictions predicting high crime rate were correct. It is hard for the model to separate between low, medium low and medium high crime rates.
Let’s reload and scale Boston data set.
# Scaling the variables in Boston data set
boston_scaled2 <- scale(Boston)
Let’s calculate the Euclidean distances between the variables.
# Distances between the observations
dist_eu <- dist(boston_scaled2)
Let’s run k-means algorithm on the dataset choosing 2 clusters.
# k-means clustering
km <-kmeans(boston_scaled2, centers = 2)
# Plotting the Boston dataset with clusters
pairs(Boston, col = km$cluster)
Let’s investigate what is the optimal number of clusters and run the algorithm again.
library(ggplot2)
# Setting seed so that initial clusters would be assinged every time the same
set.seed(123)
# Determining the number of clusters
k_max <- 20
# Calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# Visualizing the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
It is hard to tell what is the optimal number of clusters as the drop is not sharp and reminds the form of exponential function. Maybe it is around 5-6.
Let’s run kmeans again with 6 clusters.
# k-means clustering
km <-kmeans(boston_scaled2, centers = 6)
# Plotting the Boston dataset with clusters
pairs(Boston, col = km$cluster)
Plotting the clusters in the data with ggpairs.
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.3
options(warn = -1)
ggpairs(Boston, mapping = aes(color = factor(km$cluster)))
Some clusters seem to be further apart from others. It is hard to draw more conclusions from such plots.
Performing k-means on the original Boston data with some reasonable number of clusters.
# Scaling the variables
boston_scaled3 <- scale(Boston)
# Performing kmeans on scaled Boston data
km <-kmeans(boston_scaled3, centers = 6)
# Including clusters into scaled boston dataset
boston_clusters <- data.frame(boston_scaled3, km = km$cluster)
str(boston_clusters)
## 'data.frame': 506 obs. of 15 variables:
## $ crim : num -0.419 -0.417 -0.417 -0.416 -0.412 ...
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ km : int 2 2 6 2 6 2 2 2 2 2 ...
# Performing LDA using the clusters as target classes
lda.fit.cluster <- lda(km ~ ., data = boston_clusters)
# The function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# Setting target classes as numeric
classes <- as.numeric(boston_clusters$km)
# Plotting the lda results
plot(lda.fit.cluster, dimen = 2, col = classes, pch = classes)
# Plotting arrows arrows representing the relationships of the original variables to the LDA solution
lda.arrows(lda.fit, myscale = 1)
There seem to be two clusters which have longer distance from rest of the clusters. That distanced is influenced by rad variable.
model_predictors <- dplyr::select(train, -crime)
# Checking the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# Matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#install.packages('plotly')
library(plotly)
# Plotting with color argument equal to crime classes
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
# Plotting with color argument equal to kmeans clusters of the observations in train set of LDA.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = factor(km$cluster[ind]))
These two plots are similar as one group of points is further apart from rest of the data points. The difference is that in first plot, the further away group is less mixed and mainly include points of one class.
I wonder how similar they would be if 4 cluster centers would be chosen instead of 6.
# Scaling the variables
boston_scaled3 <- scale(Boston)
# Performing kmeans on scaled Boston data
km <-kmeans(boston_scaled3, centers = 4)
# Plotting with color argument equal to kmeans clusters of the observations in train set of LDA.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = factor(km$cluster[ind]))
It seems that further cluster still remains less homogenous than in the first plot with crime rate classes.
date()
## [1] "Fri Dec 04 01:29:56 2020"
In this chapter the human dataset, which originates from the United Nations Development Programme, is analyzed. The code of data preparation along with the data is in my Github repository.
Let’s load and explore the data.
# Loading the data
human <- read.table('data/human.csv',sep = ',',header = T, row.names = 1)
# Looking at the structure of the dataset
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim (human)
## [1] 155 8
names(human)
## [1] "Edu2.FM" "Labo.FM" "Life.Exp" "Edu.Exp" "GNI" "Mat.Mor"
## [7] "Ado.Birth" "Parli.F"
The data set consists of 155 observations and 8 variables. Each observation has a name which indicates a country. The variables are:
Edu2.FM: ratio of proportion of females with at least secondary education and proportion of males with at least secondary education.
Labo2.FM: ratio of proportion of females in the labour force and proportion of males in the labour force.
Life.Exp: life expectancy at birth.
Edu.Exp: expected years of schooling.
GNI: gross National Income per capita.
Mat.Mor: maternal mortality ratio.
Ado.Birth: adolescent birth rate.
Parli.F: percetange of female representatives in parliament.
# Looking at the summary of the variables
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
All of the variables have different scales. The features, especially, GNI, Mat.Mor, Ado.Birth exhibit high variability.
Let’s look at the pairwise relationships between the variables.
# Plotting a matrix of the variables
pairs(human)
library(corrplot)
library(dplyr)
# Computing the correlation matrix
cor(human) %>% corrplot(type = "upper", tl.cex=0.7)
Many variables seem to be correlated. For example, when Life.Exp (life expectancy at birth) increases, Edu.Exp (expected years of schooling) and GNI increase as well. With increasing Life.Exp, Mat.Mor (maternal mortality ratio) and Ado.Birth decreases (adolescent birth rate).
Let’s look at the distribution of the variables.
library(ggplot2)
# Ploting a histogram of life expectancy at birth variable
ggplot(human, aes(x=Life.Exp)) +theme_minimal() + geom_histogram(color="darkblue", fill="lightblue", binwidth=5) + ggtitle("Life expectancy at birth")
library(ggplot2)
# Ploting a histogram of life expectancy at birth variable
ggplot(human, aes(x=Edu.Exp)) +theme_minimal() + geom_histogram(color="darkblue", fill="lightblue", binwidth=1) + ggtitle("Expected years of schooling")+ stat_function(fun = function(x) dnorm(x, mean = mean(human$Edu.Exp), sd = sd(human$Edu.Exp)) * 1 * nrow(human))
library(ggplot2)
# Ploting a histogram of Mat.Mor variable
ggplot(human, aes(x=Mat.Mor)) +theme_minimal() + geom_histogram(color="darkblue", fill="lightblue", binwidth=50) + ggtitle("Maternal mortality ratio")
library(ggplot2)
# Ploting a histogram of Ado.Birth variable
ggplot(human, aes(x=Ado.Birth)) +theme_minimal() + geom_histogram(color="darkblue", fill="lightblue", binwidth=10) + ggtitle("Adolescent birth ratio")
library(ggplot2)
# Ploting a histogram of Parli.F variable
ggplot(human, aes(x=Parli.F)) +theme_minimal() + geom_histogram(color="darkblue", fill="lightblue", binwidth=5) + ggtitle("Percetange of female representatives in parliament")
Out of all distributions, only expected years of schooling resembles a normal distribution. Maternal mortality ratio is mostly quite low. However, there are very high ratio exceptions. Life expectancy has most counts between 70 and 80 years. In very few cases females take up more than 50% of parliament places.
In this section PCA is applied on non-standartized and standartized and results are compared at the end of the section.
options(warn = -1)
# Performing PCA (with the SVD method)
pca_human_ns <- prcomp(human)
# Creating and printing out a summary of pca_human
s <- summary(pca_human_ns)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# Rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
# Printing out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
# Creating an object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# Drawing a biplot
biplot(pca_human_ns, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], sub = 'Gross National Income per capita (GNI) has the highest variability out of all features and it highly contributes to PC1 component.', cex.sub = 0.7)
options(warn = -1)
# Standardizing the variables
human_std <- scale(human)
# Performing PCA (with the SVD method)
pca_human <- prcomp(human_std)
# Creating and printing out a summary of pca_human
s <- summary(pca_human)
s
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
# Rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
# Printing out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# Creating an object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# Drawing a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2],sub= 'Standartized variables more equally contribute to the principal components.', cex.sub=0.7)
Standardization of the data set makes a very big difference. Without standardizing, very high variability in GNI highly contributes to the first principal component and thus, all variability in the data is explained only with PC1. When data is standardized, all variables contribute more equally to the components and the first principal component explains over 50% of variability in the data.
The angles between features in the biplot of standardized data are small. They represent high correlation between those features.
Ratio of proportion of females in the labour force and proportion of males in the labour force as well as percetange of female representatives in parliament are the only two variables which contribute to the second principal component. These two variables help to explain less variance in the data compared to education, life expectancy, maternal mortality and adolescent birth ratio variables.
In this section the tea dataset from the package FactoMineR is used. It contains the answers of a questionnaire on tea consumption.
# Loading tea dataset
library(FactoMineR)
data('tea')
# Looking at the structure of the dataset
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
# Looking at the summaries of the variables
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
This dataset has 300 observations and 36 variables. As there are many variables, I will selecte only some of them for further analysis.
age_Q, sex, price, sugar, frequency,
library(dplyr)
library(ggplot2)
library(tidyr)
# Subsetting the dataset
selected_columns <- c ('age_Q', 'sex', 'price', 'sugar', 'frequency', 'relaxing' )
tea_selected <- dplyr::select(tea, one_of(selected_columns))
# Visualize the selected dataset
gather(tea_selected) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
It seems that the highest tea consumption is in the age group of 15-24 year. The cheap tea isn’t chosen often. More people find tea drinking relaxing and more female than male drinks tea. Tea with no sugar is slightly more preferred.
Let’s do the multiple correspondence analysis.
# Multiple correspondence analysis
mca <- MCA(tea_selected, graph = FALSE)
# Summarizing the model
summary(mca)
##
## Call:
## MCA(X = tea_selected, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.270 0.243 0.211 0.194 0.186 0.178 0.171
## % of var. 10.790 9.727 8.442 7.779 7.434 7.124 6.826
## Cumulative % of var. 10.790 20.517 28.958 36.737 44.171 51.295 58.121
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.165 0.149 0.144 0.142 0.122 0.117 0.113
## % of var. 6.596 5.971 5.767 5.674 4.873 4.660 4.510
## Cumulative % of var. 64.717 70.688 76.455 82.129 87.002 91.662 96.172
## Dim.15
## Variance 0.096
## % of var. 3.828
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.529 0.346 0.046 | 1.139 1.778 0.211 | 0.637 0.640
## 2 | -0.786 0.764 0.336 | 0.476 0.311 0.123 | 0.604 0.576
## 3 | -0.733 0.664 0.351 | -0.281 0.108 0.052 | 0.041 0.003
## 4 | 0.701 0.607 0.320 | -0.026 0.001 0.000 | 0.131 0.027
## 5 | -0.429 0.227 0.111 | 0.093 0.012 0.005 | -0.166 0.043
## 6 | 0.557 0.383 0.090 | -0.093 0.012 0.003 | 0.478 0.360
## 7 | -0.104 0.013 0.003 | 0.331 0.150 0.035 | -0.344 0.187
## 8 | -0.020 0.000 0.000 | -0.266 0.097 0.026 | -0.273 0.118
## 9 | -0.251 0.078 0.024 | 0.125 0.021 0.006 | -0.907 1.300
## 10 | -0.251 0.078 0.024 | 0.125 0.021 0.006 | -0.907 1.300
## cos2
## 1 0.066 |
## 2 0.198 |
## 3 0.001 |
## 4 0.011 |
## 5 0.017 |
## 6 0.066 |
## 7 0.037 |
## 8 0.028 |
## 9 0.318 |
## 10 0.318 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## 15-24 | 0.504 4.818 0.112 5.799 | -0.955 19.160 0.403
## 25-34 | 0.616 5.398 0.113 5.825 | 0.794 9.934 0.188
## 35-44 | -0.175 0.253 0.005 -1.188 | 0.224 0.458 0.008
## 45-59 | -0.990 12.305 0.250 -8.646 | 0.489 3.328 0.061
## +60 | -0.567 2.515 0.047 -3.733 | -0.150 0.196 0.003
## F | -0.385 5.447 0.217 -8.051 | -0.449 8.209 0.295
## M | 0.562 7.948 0.217 8.051 | 0.656 11.977 0.295
## p_branded | 0.031 0.019 0.000 0.363 | -0.283 1.740 0.037
## p_cheap | -0.445 0.285 0.005 -1.189 | -0.881 1.241 0.019
## p_private label | 0.891 3.436 0.060 4.228 | -0.239 0.275 0.004
## v.test Dim.3 ctr cos2 v.test
## 15-24 -10.980 | 0.450 4.896 0.089 5.171 |
## 25-34 7.502 | -0.318 1.839 0.030 -3.007 |
## 35-44 1.519 | -0.245 0.632 0.009 -1.661 |
## 45-59 4.269 | 0.626 6.291 0.100 5.468 |
## +60 -0.989 | -1.258 15.824 0.229 -8.283 |
## F -9.384 | 0.232 2.523 0.079 4.846 |
## M 9.384 | -0.339 3.680 0.079 -4.846 |
## p_branded -3.333 | 0.556 7.726 0.143 6.543 |
## p_cheap -2.355 | 0.358 0.236 0.003 0.957 |
## p_private label -1.136 | 0.823 3.742 0.051 3.903 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## age_Q | 0.409 0.483 0.373 |
## sex | 0.217 0.295 0.079 |
## price | 0.112 0.133 0.544 |
## sugar | 0.509 0.001 0.000 |
## frequency | 0.234 0.273 0.190 |
## relaxing | 0.137 0.274 0.080 |
# Visualizing the MCA
plot(mca, invisible=c("ind"), habillage = "quali")
The horizontal dimension explains 10.79% of the variance in the data whereas the vertical dimension explains only 9.73%. Together both dimension explain only around 20% of variance in the data set.
The graph in MCA presents the information in a condense form where the distance between variable categories gives a measure of their similarity.
Let’s look at some possible conclusions and see if they can be drawn from the data as well.
# Initializing a plot of alc_use and going out
ggplot(tea_selected,
aes(x = age_Q,
fill = sugar )) +
geom_bar(position = position_dodge(preserve = "single")) +labs(
title = 'Sugar preference in different age group') + theme_minimal()
Yes, from the barplot we can see, that more peaople in youger age groups prefer tea with sugar and more people in older groups, prefer the tea without sugar.
# Initializing a plot of alc_use and going out
ggplot(tea_selected,
aes(x = sex,
fill = frequency )) +
geom_bar(position = position_dodge(preserve = "single")) +labs(
title = 'Tea drinking frequencies famale vs male') + theme_minimal()
Seems that this conclusion is visible in the data as well.
# Initializing a plot of alc_use and going out
ggplot(tea_selected,
aes(x = frequency,
fill = sugar )) +
geom_bar(position = position_dodge(preserve = "single")) +labs(
title = 'Sugar preference in different age group') + theme_minimal()
It seems that this conclusion can be drawn from the data as well.
date()
## [1] "Fri Dec 04 01:30:02 2020"
# Loading packages
library(dplyr)
library(tidyr)
library(ggplot2)
# Reading the data set
RATSL <- read.csv('data/RATSL.csv')
# Looking at the structure of the data set
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
# Converting variables as factors
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <-factor(RATSL$Group)
RATSL$WD <-factor(RATSL$WD)
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
Let’s look at the plots of the data.
# Draw the plot:
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
We can see that group 1 has the smallest average weight. Group two and group three have a similar weight. There’s clearly visible outlier in the second group which has much hiher weight then the rest of the rats in that group. In time the weight of the rats grows.
# Standartizing rats weight
RATSL <- RATSL %>% group_by(Time) %>%
mutate(stdweight = (Weight - mean(Weight))/sd(Weight) ) %>% ungroup()
# Looking at the standartized data
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1,...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, ...
## $ WD <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 55...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, ...
## $ stdweight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.881900...
# Plotting again
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized weight")
After standardazing the weight, the growing trend is not anymore so visible in the data.
# Setting the number of times
n <- RATSL$Time %>% unique() %>% length()
# Summarizing the data with mean and standard error
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Looking at the data:
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 3...
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375...
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3...
# Plotting the mean profiles:
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.3)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
# Creating summary data
RATSL_2 <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
# Looking at the data
glimpse(RATSL_2)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9...
# Drawing boxplots
ggplot(RATSL_2, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "blue") +
scale_y_continuous(name = "mean(Weight)")
We see the outliers of each group in the plot.There are three outliers. For the first and third group they are the smallest values and for the second group - the biggest value.
RATSL_2 %>%
group_by(Group) %>%
summarize(max(mean))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## Group `max(mean)`
## <fct> <dbl>
## 1 1 276.
## 2 2 594
## 3 3 542.
RATSL_2 %>%
group_by(Group) %>%
summarize(min(mean))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## Group `min(mean)`
## <fct> <dbl>
## 1 1 239.
## 2 2 444.
## 3 3 495.
So for the first group the outlier value is 238.9, for the third group: 495.2 and for the second group: 594.
# Creating a new data by filtering the outliers
RATSL_3 <- RATSL_2 %>%
filter(mean !=238.9 & mean != 495.2 & mean != 594)
# Drawing a boxplot
ggplot(RATSL_3, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "blue") +
scale_y_continuous(name = "mean(Weight)")
Now the data is without outliers.
Let’s perform a t-test to see if group 1 and group 2 differs.
# Filtering data to have group 1 and group 2
RATSL_2groups <- filter(RATSL_3,(Group==1 | Group==2))
# Performing a two-sample t-test
t.test(mean ~ Group, data = RATSL_2groups, var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -44.305, df = 8, p-value = 7.434e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -193.2012 -174.0845
## sample estimates:
## mean in group 1 mean in group 2
## 268.7571 452.4000
Based on the result of the t-test, we see that at 95% confidence level, there is a significant difference between two means. As the p value is lower that 0.05 the null hypothesis (that the means are equal) is rejected.
# Adding the baseline from the original data as a new variable to the summary data
RATS <- read.csv('data/RATS.csv')
RATSL_4 <- RATSL_2 %>%
mutate(baseline = RATS$WD1)
# Fitting the linear model
fit <- lm(mean ~ baseline + Group, data = RATSL_4)
# Computing the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Reading the data set
BPRSL <- read.csv('data/BPRSL.csv')
# Looking at the structure of the data set
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
# Converting variables as factors
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)